Your goal is to forecast revenue, and you’d like to understand if their are relationships between Revenue trends / spikes and webpage traffic (do spikes in revenue relate to traffic being driven to specific pages?).
library(tidyverse)
library(timetk)
library(lubridate)
library(DataExplorer)
Read in the following data sets.
transactions_tbl <- read_rds("../00_data/transactions_weekly.rds")
transactions_tbl
## # A tibble: 91 x 2
## purchased_at revenue
## <date> <dbl>
## 1 2018-06-03 32736.
## 2 2018-06-10 38290.
## 3 2018-06-17 39974.
## 4 2018-06-24 35622.
## 5 2018-07-01 28984.
## 6 2018-07-08 47251.
## 7 2018-07-15 31512.
## 8 2018-07-22 50986.
## 9 2018-07-29 41183.
## 10 2018-08-05 36598.
## # ... with 81 more rows
google_analytics_by_page_tbl <- read_rds("../00_data/google_analytics_by_page_daily.rds")
google_analytics_by_page_tbl
## # A tibble: 4,685 x 6
## date pagePath pageViews organicSearches sessions rank
## <date> <chr> <dbl> <dbl> <dbl> <int>
## 1 2019-05-08 / 246 29 125 1
## 2 2019-05-08 /business/2016/08/07/Cus~ 38 27 32 11
## 3 2019-05-08 /business/2017/09/18/hr_~ 38 28 35 8
## 4 2019-05-08 /business/2017/10/16/sal~ 44 30 36 12
## 5 2019-05-08 /business/2017/12/27/six~ 28 14 22 9
## 6 2019-05-08 /business/2018/10/08/pyt~ 36 28 31 13
## 7 2019-05-08 /business/2019/03/11/ab-~ 20 15 16 10
## 8 2019-05-08 /learn.html 12 1 6 19
## 9 2019-05-08 /p/ds4b-101-r-business-a~ 8 1 5 14
## 10 2019-05-08 /p/jumpstart-with-r 71 5 28 6
## # ... with 4,675 more rows
transactions_tblplot_time_series() to visualize purchased at vs revenuetransactions_tbl %>%
plot_time_series(
.date_var = purchased_at
, .value = revenue
)
There are 20 pages x 3 Metrics (pageViews, organicSearch, and sessions). To make it easier to visualize, do this:
google_analytics_by_page_tblplot_time_series() with .interactive = FALSE and .facet_ncol = 4 (this helps to visualize all of the facets)google_analytics_by_page_tbl %>%
group_by(pagePath) %>%
plot_time_series(
.date_var = date
, .value = sessions
, .interactive = FALSE
, .facet_ncol = 4
)
We need to aggregate both data sets to get them on a common frequency before we can join them
transactions_tblsummarise_by_time() with .by = "week", and sum() the revenue.transactions_weekly_tbltransactions_weekly_tbl <- transactions_tbl %>%
summarise_by_time(
.date_var = purchased_at
, .by = "week"
, revenue = sum(revenue, na.rm = TRUE)
)
google_analytics_by_page_tblsummarise_by_time() with .by = "week", and sum() the sessions.pivot_wider() to pivot the names from “pagePath” and values from “sessions” to a wide data. Use names_prefix = "sessions_" to identify the new columns as coming from sessions.contains("/p/") (these are product pages)product_page_sessions_weekly_tblproduct_page_sessions_weekly_tbl <- google_analytics_by_page_tbl %>%
select(date, pagePath, sessions) %>%
group_by(pagePath) %>%
summarise_by_time(
.date_var = date
, .by = "week"
, sessions = sum(sessions, na.rm = TRUE)
) %>%
ungroup() %>%
pivot_wider(names_from = pagePath, values_from = sessions, names_prefix = "sessions_") %>%
select(date, contains("/p/"))
left_join() to join transactions_weekly_tbl and product_page_sessions_weekly_tbltransactions_product_page_sessions_weekly_tbltransactions_product_page_sessions_weekly_tbl <- left_join(
x = transactions_weekly_tbl
, y = product_page_sessions_weekly_tbl
, by = c("purchased_at"="date")
)
plot_missing() to inspect the missing data in transactions_product_page_sessions_weekly_tblplot_missing(transactions_product_page_sessions_weekly_tbl)
# Load Checkpoint Data
transactions_product_page_sessions_weekly_tbl <-
read_rds("challenge_01_data_checkpoints/transactions_product_page_sessions_weekly_tbl.rds")
transactions_product_page_sessions_weekly_tblpivot_longer() everything except “purchased_at” to form the data for plotting. This creates 2 columns, “name” and “value”.plot_time_series() to visualize “purchased_at” vs “value”. Use `.facet_ncol = 3# CODE HERE
Several of the columns have a lot of missing data. These are pages that did not exist until recently in the data, and unfortunately we aren’t going to be able to use them because they will result in data too few rows for the analysis.
transactions_product_page_sessions_weekly_tblends_with() “with-r/”. These columns have very low data.transactions_product_page_subset_tbl# CODE HERE
drop_na()mutate() and across():
log1p()standardize_vec()log_standardized_transactions_product_page_tbl# CODE HERE
Visualize cross correlations between revenue and anything that contains("session"):
log_standardized_transactions_product_page_tblplot_acf_diagnostics() with:
.ccf_vars = contains("session").show_ccf_vars_only = TRUE.facet_ncol = 2# CODE HERE
What can you say about the relationship between 101 Course Page visits & revenue?
What about the relationship between Jumpstart, Learning Labs PRO page visits and Revenue?